Recognizing Unregistered Names for Mandarin Word Identification

نویسندگان

  • Liang-Jyh Wang
  • Wei-Chuan Li
  • Chao-Huang Chang
چکیده

Word Identification has been an important and active issue in Chinese Natural Language Processing. In this paper, a new mechanism, based on the concept of sublanguage, is proposed for identifying unknown words, especially personal names, in Chinese newspapers. The proposed mechanism includes title.driven name recognition, adaptive dynamic word formation, identification of Z-character and 3-character Chinese names without title. We will show the e~:perimental results for two corpora and compare them with the results by the NTIIU's statistic-based system, the only system that we know has attacked the same problem. The ezperimental results have shown significant improvements over the WI systems without the name identification capability. 1 I n t r o d u c t i o n Word Identification (WI, also known as Segmentation) has been an impor tant and active issue ill Chinese Natural Language Processing. Various approaches are proposed for this problem [1], such as MM (Maximum Matclfing) method [8], RMM (Reverse Directional Maximum Matching) metlmd, OM (Opt imum Matching) method, statistical approaches [5], and unification approaches [12]. lIowever, there are still a number of problems to conquer towards a sat isfactory WI system. Among them are a clear definition of Chinese words, an objective evaluation suite with appropriate corpora, and the processing of unknown words (such as personal names, place names, and organization names). In this paper, we will deal with the problem of unknown words, especially personal names, althougii the proposed approach can be easily extended to cover place nantes and organization nantes. According to Chang, et al. [2], proper nouns (which compose a major par t of unknown words) account for more than fifty percent of errors made by a typical system. Thus, successful processing of proper nouns is essential for a sat isfactory WI system. Almost all WI systems use a lexicon to guide the segmentation process. In fixed domains such as a classical novel or technical texts, we can put all possible words in the lexicon and avoid the unknownword problem. However, in a dynamic domain such as newspapers, it is impossible to enumerate all possible words in advance. For example, some personal names, such as suspects or victims , often appear in only one day 's news. Thus, recognition of these personal names and other unknown words is very important . Chang, et al. [2] (at National Ts ing-Hua University, ttsinchu, Taiwan) proposed a Multiple-Corpus approach to solve the problem. They consider the WI problem as a constraint satisfaction problem (CSP) and use a number of corpora to train their statisticbased system. The probabilities of each Chinese character as a surnanm, the first character and the second character in a first name are computed based on the training. Using these statistics, two-character and three.character personal names are proposed to compete with the words in the lexicon. Then, a dynamic programming technique is used to decide the most probable solution to the CSP. They reported a 90 percent average correct rate of surname-name identification. To the best of our knowledge, this is the only group tha t has proposed a solution to the problem. Chang 's approach is completely statist ic-based and easy-to-implenmnt. However, we argue tha t syntactic and semantic information must be considered in a successfid WI system. 2 A S u b l a n g u a g e A p p r o a c h The concept of sublanguages (i.e., languages in restricted domains) has been considered very important in natural language processing [6, 7]. A sublanguage usually has its own special syntax, semantics, and style, which are more restricted comparing with the language as a whole. In this paper, we will show how the s tudy of a sublanguage can help identifying names and forming them in a dynamic, adaptive way.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Word Identification For Mandarin Chinese Sentences

Chinese sentences are composed with string of characters without blanks to mark words. However the basic unit for sentence parsing and understanding is word. Therefore the first step of processing Chinese sentences is to identify the words. The difficulties of identifying words include (l) the identification of complex words, such as Determinative-Measure, redupli-cations, derived words etc., (...

متن کامل

Unregistered Biological Words Recognition by Q-Learning with Transfer Learning

Unregistered biological words recognition is the process of identification of terms that is out of vocabulary. Although many approaches have been developed, the performance approaches are not satisfactory. As the identification process can be viewed as a Markov process, we put forward a Q-learning with transfer learning algorithm to detect unregistered biological words from texts. With the Q-le...

متن کامل

Perceptual confusability of Mandarin sounds, tones and syllables

This paper reports a perceptual identification study for Mandarin sounds, tones and whole syllables, using phonotactically plausible non-word stimuli covered in white noise. The results showed that while the accuracy of whole-syllable identification could be estimated by the independent accuracy of initial and final identification, syllable-level confusability patterns were related to, but not ...

متن کامل

The interlanguage speech intelligibility benefit for native speakers of Mandarin: Production and perception of English word-final voicing contrasts

This study investigated the intelligibility of native and Mandarin-accented English speech for native English and native Mandarin listeners. The word-final voicing contrast was considered (as in minimal pairs such as `cub' and `cup') in a forced-choice word identification task. For these particular talkers and listeners, there was evidence of an interlanguage speech intelligibility benefit for ...

متن کامل

Recognizing the emotional valence of names: an ERP study.

Unlike common nouns, person names refer to unique entities and generally have a referring function. We used event-related potentials to investigate the time course of identifying the emotional meaning of nouns and names. The emotional valence of names and nouns were manipulated separately. The results show early N1 effects in response to emotional valence only for nouns. This might reflect auto...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1992